1. Introduction¶
This analysis is part of the Udacity Data Analysis Nanodegree Program and aims to explore the dataset of medical appointments no show in Brazil. This analysis is divided in four main parts:
- Introduction, where the initial info is provided and the problem is set.
- Data Wrangling, where the data is cleaned and prepared for analysis.
- Exploratory Data Analysis, where key patterns are to be found.
- Conclusions, in which the insights are described.
Dataset Description¶
This dataset contains over 100K medical appointments realized in Brazil. The information provided concerns The Brazil Public health system, known as SUS (Sistema Único de Saúde, Unified Health System), one of the largest health system in the world and entirely free of any cost. Being a system used by over 220 million Brazillians.
The dataset have 14 columns as so:
- PatientID, Identification of a patient.
- AppointmentID, Identification of each appointment.
- Gender, "M" for Male and "F" for Female.
- ScheduledDay, the day the patient set up the appointment.
- AppointmentDay.
- Age, age in years.
- Neighbourhood, the location of the appointment.
- Scholarship, whether the patient is enrolled in Bolsa Família.
- Hipertension, True or False.
- Diabetes, True or False.
- Alcoholism, True or False.
- Handcap, True or False.
- SMS_received, True or False.
- No-show, Categorical type, if the appointment was a no-show or not.
Question(s) for Analysis¶
- What is the mean time between Scheduled day and the appointment day?
- There is a direct relation between this time difference and the no-shows?
- SMS is a key factor to reduce no-shows?
- About location, which neighborhood have the highest no show rate?
Data Wrangling¶
In this section of the report, the data will be loaded in, checked for cleanliness, cleanned as necessary.
# import necessary packages for analysis and data visualization
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'notebook'
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110527 entries, 0 to 110526 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PatientId 110527 non-null float64 1 AppointmentID 110527 non-null int64 2 Gender 110527 non-null object 3 ScheduledDay 110527 non-null object 4 AppointmentDay 110527 non-null object 5 Age 110527 non-null int64 6 Neighbourhood 110527 non-null object 7 Scholarship 110527 non-null int64 8 Hipertension 110527 non-null int64 9 Diabetes 110527 non-null int64 10 Alcoholism 110527 non-null int64 11 Handcap 110527 non-null int64 12 SMS_received 110527 non-null int64 13 No-show 110527 non-null object dtypes: float64(1), int64(8), object(5) memory usage: 11.8+ MB
df.duplicated().value_counts()
False 110527 dtype: int64
As it is seen in the previous outputs, there is no missing data and no duplicated data.
# Change names to lower case
for name in df.columns:
df.rename(columns={name:name.lower()},
inplace=True)
# Rename incorrect column names
df.rename(columns={'hipertension': 'hypertension',
'handcap': 'handicap',
'no-show': 'noshow'},
inplace=True);
Categorical Data¶
PatientId¶
The column is loaded as float64, but it represents the unique identification of the patient. So there is no reason to apply any numerical operations.
# Change the dtype of PatientId to string
# The PatientId is changed to int first, just to remove the decimals and have a cleaner look
df['patientid'] = df['patientid'].astype('int64').astype(str)
# Checking for number of uniques patients Id.
df.patientid.nunique()
62299
The lower number of unique patients indicates that one or more patients made more than one appointment.
Appointment ID¶
As so as Patient ID, this data should be a string value.
# Change the dtype of AppointmentID
df['appointmentid'] = df['appointmentid'].astype(str)
df['appointmentid'].nunique()
110527
This column have all unique values and cotains the information generated by a system do unique identify appointments. For the analysis, this columns will be used as index for more clarity and meaning to the index.
df.set_index('appointmentid', drop=True, inplace=True)
Gender and No-Show¶
Both will be checked for any inconsistenses and have the datatype change to categorical for data efficiency
#Checking for the different values
df.gender.unique()
array(['F', 'M'], dtype=object)
# Changing the dtype to category
df['gender'] = df['gender'].astype('category')
df.noshow.unique()
array(['No', 'Yes'], dtype=object)
# Changing the dtype to category
df['noshow'] = df['noshow'].astype('category')
Neighbourhood¶
df.neighbourhood.unique()
array(['JARDIM DA PENHA', 'MATA DA PRAIA', 'PONTAL DE CAMBURI',
'REPÚBLICA', 'GOIABEIRAS', 'ANDORINHAS', 'CONQUISTA',
'NOVA PALESTINA', 'DA PENHA', 'TABUAZEIRO', 'BENTO FERREIRA',
'SÃO PEDRO', 'SANTA MARTHA', 'SÃO CRISTÓVÃO', 'MARUÍPE',
'GRANDE VITÓRIA', 'SÃO BENEDITO', 'ILHA DAS CAIEIRAS',
'SANTO ANDRÉ', 'SOLON BORGES', 'BONFIM', 'JARDIM CAMBURI',
'MARIA ORTIZ', 'JABOUR', 'ANTÔNIO HONÓRIO', 'RESISTÊNCIA',
'ILHA DE SANTA MARIA', 'JUCUTUQUARA', 'MONTE BELO',
'MÁRIO CYPRESTE', 'SANTO ANTÔNIO', 'BELA VISTA', 'PRAIA DO SUÁ',
'SANTA HELENA', 'ITARARÉ', 'INHANGUETÁ', 'UNIVERSITÁRIO',
'SÃO JOSÉ', 'REDENÇÃO', 'SANTA CLARA', 'CENTRO', 'PARQUE MOSCOSO',
'DO MOSCOSO', 'SANTOS DUMONT', 'CARATOÍRA', 'ARIOVALDO FAVALESSA',
'ILHA DO FRADE', 'GURIGICA', 'JOANA D´ARC', 'CONSOLAÇÃO',
'PRAIA DO CANTO', 'BOA VISTA', 'MORADA DE CAMBURI', 'SANTA LUÍZA',
'SANTA LÚCIA', 'BARRO VERMELHO', 'ESTRELINHA', 'FORTE SÃO JOÃO',
'FONTE GRANDE', 'ENSEADA DO SUÁ', 'SANTOS REIS', 'PIEDADE',
'JESUS DE NAZARETH', 'SANTA TEREZA', 'CRUZAMENTO',
'ILHA DO PRÍNCIPE', 'ROMÃO', 'COMDUSA', 'SANTA CECÍLIA',
'VILA RUBIM', 'DE LOURDES', 'DO QUADRO', 'DO CABRAL', 'HORTO',
'SEGURANÇA DO LAR', 'ILHA DO BOI', 'FRADINHOS', 'NAZARETH',
'AEROPORTO', 'ILHAS OCEÂNICAS DE TRINDADE', 'PARQUE INDUSTRIAL'],
dtype=object)
df.neighbourhood.nunique()
81
The Neighbourhood columns has 81 unique values and thereis no need to change anything.
Datetime data¶
Both columns AppointmentDay and ScheduleDay should be in datetime format.
#Change both columns to datetime and normalizing the columns to kee just the "day" information
df['appointmentday']= pd.to_datetime(df['appointmentday']).dt.normalize()
df['scheduledday'] = pd.to_datetime(df['scheduledday']).dt.normalize()
# Create a month column for easy data manipulation
df['appointmentday_month'] = df['appointmentday'].dt.month_name()
df['scheduledday_month'] = df['scheduledday'].dt.month_name()
It can be presumed that all the ScheduledDays are made before the Appointment Day.
To check this, a new column will becreated and filter by this information.
# Create a new column 'BetweenDays'
df['betweendays'] = df['appointmentday'] - df['scheduledday']
# Create a colunm with the value in float for easy data manipulation
df['betweendays_float'] = df['betweendays'].dt.total_seconds() / (24 * 60 * 60)
# Check for posible inconsistenses
df['betweendays'].describe()
count 110527 mean 10 days 04:24:31.828602965 std 15 days 06:07:11.673762786 min -6 days +00:00:00 25% 0 days 00:00:00 50% 4 days 00:00:00 75% 15 days 00:00:00 max 179 days 00:00:00 Name: betweendays, dtype: object
# Checking how many data have inconistent data.
df[df['betweendays'].dt.days < 0].shape
(5, 17)
Since there is only 5 rows with invalid data, the data will be droped from the dataframe
df.drop(
df[df['betweendays'].dt.days < 0].index,
inplace=True
)
Lets transform the float value in the betweendays_float in a categorical type. This will help in further analysis by grouping this data.
def categorical_betweendays(row):
if row['betweendays_float'] == 0:
return '0 Days'
elif row['betweendays_float'] <=3:
return '1-3 Days'
elif row['betweendays_float'] <=7:
return '4-7 Days'
elif row['betweendays_float'] <=14:
return '8-14 Days'
elif row['betweendays_float'] <=30:
return '15-30 Days'
else:
return '30+ Days'
df['betweendays_cate'] = df.apply(categorical_betweendays, axis=1)
Numeric data¶
For the numeric data, it is expected that the rows 'Scholarship', 'Hypertension', 'Diabetes', 'Alcoholism', 'Handicap' and 'SMSReceived' only have ints from zero to one.
df[['scholarship', 'hypertension', 'diabetes', 'alcoholism', 'handicap', 'sms_received']].describe()
| scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | |
|---|---|---|---|---|---|---|
| count | 110522.000000 | 110522.000000 | 110522.000000 | 110522.000000 | 110522.000000 | 110522.000000 |
| mean | 0.098270 | 0.197255 | 0.071868 | 0.030401 | 0.022231 | 0.321040 |
| std | 0.297681 | 0.397928 | 0.258270 | 0.171689 | 0.161493 | 0.466878 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 |
As it is possible to see, all rows have only ones and zeros, except from Handicap, where the max value is four.
As it is stated in this discussion on Kaggle, that means the number of desabilites a person has. For exemple, a person who is blind and has a difficult to walk, the total value should be 2.
With this, we will check for the number of values and leave it the same way.
print(
f"""
Number of Appoints where Handicap > 1: {df[df.handicap > 1].shape[0]}
Percentual: {df[df.handicap > 1].shape[0] / df.shape[0] * 100:.3f} %
"""
)
Number of Appoints where Handicap > 1: 199
Percentual: 0.180 %
There is aproximally only 0.2% of the data with with more than one desabilit.
Age column¶
For the Age column, we need to checkif there is any inconsistent value.
df.age.describe()
count 110522.000000 mean 37.089041 std 23.110064 min -1.000000 25% 18.000000 50% 37.000000 75% 55.000000 max 115.000000 Name: age, dtype: float64
We see outiers as there is negative data and values over 100.
As for the negative data, it has no meaning, so it will be droped.
As for the values over 100, Althoud it is hard to believe, we will check the number of cases and see how it impact the data.
# check for values lower 0 years
df[df.age < 0]
| patientid | gender | scheduledday | appointmentday | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | noshow | appointmentday_month | scheduledday_month | betweendays | betweendays_float | betweendays_cate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| appointmentid | ||||||||||||||||||
| 5775010 | 465943158731293 | F | 2016-06-06 00:00:00+00:00 | 2016-06-06 00:00:00+00:00 | -1 | ROMÃO | 0 | 0 | 0 | 0 | 0 | 0 | No | June | June | 0 days | 0.0 | 0 Days |
# Drop that column
df.drop(
df[df.age < 0].index,
inplace= True
)
df[df.age > 100]
| patientid | gender | scheduledday | appointmentday | age | neighbourhood | scholarship | hypertension | diabetes | alcoholism | handicap | sms_received | noshow | appointmentday_month | scheduledday_month | betweendays | betweendays_float | betweendays_cate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| appointmentid | ||||||||||||||||||
| 5651757 | 976294799775439 | F | 2016-05-03 00:00:00+00:00 | 2016-05-03 00:00:00+00:00 | 102 | CONQUISTA | 0 | 0 | 0 | 0 | 0 | 0 | No | May | May | 0 days | 0.0 | 0 Days |
| 5700278 | 31963211613981 | F | 2016-05-16 00:00:00+00:00 | 2016-05-19 00:00:00+00:00 | 115 | ANDORINHAS | 0 | 0 | 0 | 0 | 1 | 0 | Yes | May | May | 3 days | 3.0 | 4-7 Days |
| 5700279 | 31963211613981 | F | 2016-05-16 00:00:00+00:00 | 2016-05-19 00:00:00+00:00 | 115 | ANDORINHAS | 0 | 0 | 0 | 0 | 1 | 0 | Yes | May | May | 3 days | 3.0 | 4-7 Days |
| 5562812 | 31963211613981 | F | 2016-04-08 00:00:00+00:00 | 2016-05-16 00:00:00+00:00 | 115 | ANDORINHAS | 0 | 0 | 0 | 0 | 1 | 0 | Yes | May | April | 38 days | 38.0 | 30+ Days |
| 5744037 | 31963211613981 | F | 2016-05-30 00:00:00+00:00 | 2016-05-30 00:00:00+00:00 | 115 | ANDORINHAS | 0 | 0 | 0 | 0 | 1 | 0 | No | May | May | 0 days | 0.0 | 0 Days |
| 5751563 | 234283596548 | F | 2016-05-31 00:00:00+00:00 | 2016-06-02 00:00:00+00:00 | 102 | MARIA ORTIZ | 0 | 0 | 0 | 0 | 0 | 0 | No | June | May | 2 days | 2.0 | 1-3 Days |
| 5717451 | 748234579244724 | F | 2016-05-19 00:00:00+00:00 | 2016-06-03 00:00:00+00:00 | 115 | SÃO JOSÉ | 0 | 1 | 0 | 0 | 0 | 1 | No | June | May | 15 days | 15.0 | 15-30 Days |
Fo the data with age over 100, it is possible to see that there are few cases, only 7.
And by looking in the PatientId, 4 of those appointments were made for the same patient in the spam of 3 months.
Exploratory Data Analysis¶
As a step to prepare for the further analysis, functions will be defined to help explore more of the columns
def get_statistcs(df, col_name, height=400, width=400):
"""
Loads resume statics for certain colunm in a pandas dataframe. Also prints a basic violin plot.
Args:
df_col: Pandas DataFrame.
col_name: str, column's name. Ntype numeric only.
height: int, graph's height. Default = 400
width: int, graph's width. Default = 400
Return:
Statistics, violin plot
"""
print(f"""
Mean: {df[col_name].mean():.3f}
Std deviation: {np.std(df[col_name]):.3f}
Max value: {df[col_name].max()}
Min value: {df[col_name].min()}
Median: {df[col_name].median()}
Number of unique values: {df[col_name].nunique()}
""")
fig = px.violin(df, y=col_name, box=True, height=height, width=width, points='outliers')
fig.show()
1. What is the mean time between Scheduled day and the appointment day? (done)¶
get_statistcs(df, 'betweendays_float')
Mean: 10.184
Std deviation: 15.255
Max value: 179.00000000000003
Min value: 0.0
Median: 4.0
Number of unique values: 129
# Find the betweendays mean
print('---')
print(f'Mean time between schedule and appointment:\n{df.betweendays.mean()}')
print('---')
# Plot a graph of this mean over for each month
df_fig = pd.DataFrame(df.groupby('appointmentday_month').betweendays_float.mean()).reset_index()
fig1 = px.bar(
df_fig,
x = 'appointmentday_month',
y = 'betweendays_float',
height= 400,
width= 500,
title='Mean days between Schedule and Appointment',
text_auto='.2f'
)
fig1.show()
--- Mean time between schedule and appointment: 10 days 04:25:27.412889858 ---
As it is possible to see by the graph, the mean days between schedule and appointment days is about 10 days. The standard deviation is about 15 days, meaning that the data is well dispersed. That also suggests that the data does not follow a standard distribuition.
There is no significant variation over time.
1.1. There is a direct relation between this time difference and the no-shows?¶
# Group data by noshow
df_fig = df.groupby('noshow')['betweendays_float'].mean().reset_index()
fig = px.bar(
df_fig,
x = 'noshow',
y = 'betweendays_float',
height= 400,
width= 500,
title='Mean days between by noshow',
text_auto='.2f'
)
fig.show()
As the graph above suggests, there is a relation between the noshow and the days between scheduled and appoinment.
# Group data by noshow
df_fig = df.groupby(['betweendays_cate','noshow']).patientid.count().reset_index()
df_fig = df_fig.pivot(
index='betweendays_cate',
columns='noshow',
values='patientid'
).reset_index()
df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100
df_fig.sort_values(by='noshow_rate', inplace=True)
fig = px.bar(
df_fig,
x = 'betweendays_cate',
y = 'noshow_rate',
# color = 'noshow',
# barmode='group',
height= 400,
width= 500,
title='Noshow rate over between days',
text_auto='.2f'
)
fig.show()
From the graph, it can be shown that the noshow rate increases the longer the time between scheduling and the scheduled day. The lowest value is reached when the consultation occurs on the same day, where the noshow rate is less than 5%. in this case, the noshow may be associated with emergencies.
2. Is SMS a key factor to reduce no-shows?¶
get_statistcs(df, 'sms_received')
Mean: 0.321
Std deviation: 0.467
Max value: 1
Min value: 0
Median: 0.0
Number of unique values: 2
The graph above does not present a meaningful visualization as SMS data can be interpreted as categorical. However, the relevant value is the average, which shows us the frequency of receiving SMS, being around 30%.
Therefore, it is interesting to analyze whether there is any relationship with location, as SMS may not be sent by some location.
df_fig = df.groupby(['neighbourhood', 'sms_received']).patientid.count().reset_index()
df_fig1 = df_fig.pivot(
index='neighbourhood',
columns='sms_received',
values='patientid'
).reset_index()
df_fig1['sms_received_rate'] =( df_fig1[1] / (df_fig1[1] + df_fig1[0])) * 100
fig = px.bar(
df_fig1,
x = 'neighbourhood',
y = 'sms_received_rate',
height= 400,
width= 800,
title='Noshow rate if sms_received',
text_auto='.1f'
)
fig.show()
get_statistcs(df_fig1, 'sms_received_rate')
Mean: 33.081
Std deviation: 6.388
Max value: 60.0
Min value: 12.5
Median: 33.33333333333333
Number of unique values: 79
df_fig1.loc[df_fig1['neighbourhood'].isin(['ILHAS OCEÂNICAS DE TRINDADE', 'PARQUE INDUSTRIAL'])]
| sms_received | neighbourhood | 0 | 1 | sms_received_rate |
|---|---|---|---|---|
| 34 | ILHAS OCEÂNICAS DE TRINDADE | 2.0 | NaN | NaN |
| 51 | PARQUE INDUSTRIAL | 1.0 | NaN | NaN |
With the analyzes and graphs above, it is possible to see that there are 2 outliers that have a zero sms_received_rate value. This happens because there is not enough sample for these neighbourhoods. Since there are only 3 entries in the original dataframe.
Despite this, when analyzing the proportion of all neighbourhoods, it is possible to see a distribution close to normal and with an average of 33%, and standard deviation of 6%.
There is no big difference between the distributions between the neighborhoods and apart from the outliers, there is no impediment to continuing the analyses.
Now let's look at the no-show rate in relation to sms_received
df_fig = df.groupby(['sms_received', 'noshow']).patientid.count().reset_index()
df_fig = df_fig.pivot(
index='sms_received',
columns='noshow',
values='patientid'
).reset_index()
df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100
fig = px.bar(
df_fig,
x = 'sms_received',
y = 'noshow_rate',
height= 400,
width= 500,
title='Noshow rate if sms_received',
text_auto='.1f'
)
fig.show()
The previous result is different from what was expected. It is expected that when receiving the SMS, the reminder is a mechanism to avoid noshow, however the highest noshow rate appears when the SMS is received.
Now let's segment this result by the days between the scheduled and scheduled to investigate further.
df_fig = df.groupby(['betweendays_cate','sms_received', 'noshow']).patientid.count().reset_index()
df_fig = df_fig.pivot(
index=['betweendays_cate','sms_received'],
columns='noshow',
values='patientid'
).reset_index()
df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100
df_fig['sms_received'] = df_fig['sms_received'].astype(str)
fig2 = px.bar(
df_fig,
x = 'betweendays_cate',
y = 'noshow_rate',
color='sms_received',
barmode='group',
title='No show rate by sms_received over days between',
text_auto=True
)
fig2.show()
When segmenting the data by categories of days between, the result is as expected.
The previous value was probably biased due to there being no SMS sent for appointments on the same day and for appointments between 1 and 3 days apart.
With the analysis above, it can be seen that SMS contributes to reducing the noshow rate.
3. About location, with neighborhood have the highest no show rate? (done)¶
get_statistcs(df_fig, 'noshow_rate')
Mean: 27.401
Std deviation: 8.945
Max value: 37.525201612903224
Min value: 4.647061874384108
Median: 28.956443683457362
Number of unique values: 10
# Group data by noshow
df_fig = df.groupby(['neighbourhood', 'noshow']).patientid.count().reset_index()
df_fig = df_fig.pivot(
index='neighbourhood',
columns='noshow',
values='patientid'
).reset_index()
# Filter for low volume of data
df_fig = df_fig.where(df_fig['Yes'] + df_fig['No'] > 10)
# Create noshow_rate column
df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100
# Sort values and get the top 10
df_fig1 = df_fig.sort_values(by='noshow_rate', ascending=False).head(10)
fig3 = px.bar(
df_fig1,
x = 'neighbourhood',
y = 'noshow_rate',
height= 400,
width= 500,
title='Top 10 neighbourhood by noshow rate',
text_auto='.1f'
)
fig3.show()
df_fig2 = df_fig.sort_values(by='noshow_rate', ascending=False).tail(10)
fig = px.bar(
df_fig2,
x = 'neighbourhood',
y = 'noshow_rate',
height= 400,
width= 500,
title='Bottom 10 neighbourhood by noshow rate',
text_auto='.1f'
)
fig.show()
get_statistcs(df_fig, 'noshow_rate')
Mean: 19.887
Std deviation: 3.080
Max value: 28.91849529780564
Min value: 8.571428571428571
Median: 19.758848697005057
Number of unique values: 77
When analyzing the noshow rate by neighbourhood, it can be seen that some regions have a higher noshow rate than others, however, when analyzing the average and distribution data with statistics and the violin graph, it can be seen that there is less dispersion, being closer to a normal distribution.
Conclusions¶
This analysis sought to explore data from medical appointments made by the Unified Health System (SUS) in Brazil. The focus of the analysis and the problem is understanding the reasons why more than 30% of appointments are canceled or simply not made. This high volume of noshows shows an inefficiency in the scheduling system and also an operational inefficiency, so that maintaining these indicators represents a large financial volume released.
The analyzed data was treated and explored as previously described with the purpose of understanding the possible causes of no shows, but statistical and conclusive conclusions cannot be made. The database presents a series of limitations regarding the explanation of the data and it is also necessary to carry out a more in-depth study, taking into account other factors to be able to draw better conclusions in actions.
About analysis limitations¶
- Concerning data scope
The data only relates to the municipality of Vitória (ES). As the SUS is a nationwide system, when analyzing data from just one region, we have an enviable and not very comprehensive analysis.
The data also relates to a period of just 3 months. For a system of the magnitude of the SUS, having a larger time scope would also be more relevant to analyze causes of seasonality and changes over time.
- Concerning data quality
Regarding the quality of the data, such a detailed explanation of the information was not available, so that in some scenarios, some conclusions were inferred. An example was the case of values greater than 1 in the handicap column, as explained previously.
The sms_received data also appears to be incomplete, meaning that there is no information about the sending attempt and, based on analysis, when the period between scheduling and appointment is short, up to 3 days, sending/receiving SMS does not happen.
About the questions made¶
1. What is the mean time between Scheduled day and the appointment day?¶
fig1.show()
As you can see from the graph above, the average number of days is 10 days and 4 hours. This value does not show major changes over the months and the reasons for the change were not identified.
2. SMS is a key factor to reduce no-shows?¶
fig2.show()
The graph above shows that yes, receiving SMS is a significant factor in reducing no-shows. The visualization also shows that this process does not occur when the time gap between scheduling and appointment, as there is no data for when the time gap is short.
3. About location, which neighborhood have the highest no show rate?¶
fig3.show()
As the graph above shows, the neighbourhoods that have the highest noshow rates are:
- SANTOS DUMONT
- SANTA CECÍLIA
- SANTA CLARA
- ITARARÉ
- JESUS DE NAZARETH
# Running this cell will execute a bash command to convert this notebook to an .html file
# !python -m nbconvert --to html Investigate_a_Dataset.ipynb